28 research outputs found

    Discovering the Interpretability-Performance Pareto Front of Decision Trees with Dynamic Programming

    Full text link
    Decision trees are known to be intrinsically interpretable as they can be inspected and interpreted by humans. Furthermore, recent hardware advances have rekindled an interest for optimal decision tree algorithms, that produce more accurate trees than the usual greedy approaches. However, these optimal algorithms return a single tree optimizing a hand defined interpretability-performance trade-off, obtained by specifying a maximum number of decision nodes, giving no further insights about the quality of this trade-off. In this paper, we propose a new Markov Decision Problem (MDP) formulation for finding optimal decision trees. The main interest of this formulation is that we can compute the optimal decision trees for several interpretability-performance trade-offs by solving a single dynamic program, letting the user choose a posteriori the tree that best suits their needs. Empirically, we show that our method is competitive with state-of-the-art algorithms in terms of accuracy and runtime while returning a whole set of trees on the interpretability-performance Pareto front

    A survey of preference-based reinforcement learning methods

    Get PDF
    Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function of ten requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert\u27s preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL

    A survey of preference-based reinforcement learning methods

    Get PDF
    Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task- specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert's preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL

    Direct Value Learning: a Rank-Invariant Approach to Reinforcement Learning

    Get PDF
    International audienceTaking inspiration from inverse reinforcement learning, the proposed Direct Value Learning for Reinforcement Learning (DIVA) approach uses light priors to gener-ate inappropriate behaviors, and uses the corresponding state sequences to directly learn a value function. When the transition model is known, this value function directly defines a (nearly) optimal controller. Otherwise, the value function is extended to the state-action space using off-policy learning. The experimental validation of DIVA on the mountain car problem shows the robustness of the approach comparatively to SARSA, based on the assumption that the target state is known. The experimental validation on the bicycle problem shows that DIVA still finds good policies when relaxing this assumption

    Projections for Approximate Policy Iteration Algorithms

    Get PDF
    Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms

    Programming by Feedback

    Get PDF
    International audienceThis paper advocates a new ML-based programming framework, called Programming by Feedback (PF), which involves a sequence of interactions between the active computer and the user. The latter only provides preference judgments on pairs of solutions supplied by the active computer. The active computer involves one learning and one optimization components; the learning component estimates the user's utility function and accounts for the user's (possibly limited) competence; the optimization component explores the search space and returns the most appropriate candidate solution. A proof of principle of the approach is proposed, showing that PF requires a handful of interactions in order to solve some discrete and continuous benchmark problems

    Compatible natural gradient policy search

    Get PDF
    Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks

    Apprentissage par renforcement robuste reposant sur l'apprentissage par préférences

    No full text
    The thesis contributions resolves around sequential decision taking and more precisely Reinforcement Learning (RL). Taking its root in Machine Learning in the same way as supervised and unsupervised learning, RL quickly grow in popularity within the last two decades due to a handful of achievements on both the theoretical and applicative front. RL supposes that the learning agent and its environment follow a stochastic Markovian decision process over a state and action space. The process is said of decision as the agent is asked to choose at each time step an action to take. It is said stochastic as the effect of selecting a given action in a given state does not systematically yield the same state but rather defines a distribution over the state space. It is said to be Markovian as this distribution only depends on the current state-action pair. Consequently to the choice of an action, the agent receives a reward. The RL goal is then to solve the underlying optimization problem of finding the behaviour that maximizes the sum of rewards all along the interaction of the agent with its environment. From an applicative point of view, a large spectrum of problems can be cast onto an RL one, from Backgammon (TD-Gammon, was one of Machine Learning first success giving rise to a world class player of advanced level) to decision problems in the industrial and medical world. However, the optimization problem solved by RL depends on the prevous definition of a reward function that requires a certain level of domain expertise and also knowledge of the internal quirks of RL algorithms. As such, the first contribution of the thesis was to propose a learning framework that lightens the requirements made to the user. The latter does not need anymore to know the exact solution of the problem but to only be able to choose between two behaviours exhibited by the agent, the one that matches more closely the solution. Learning is interactive between the agent and the user and resolves around the three main following points: i) The agent demonstrates a behaviour ii) The user compares it w.r.t. to the current best one iii) The agent uses this feedback to update its preference model of the user and uses it to find the next behaviour to demonstrate. To reduce the number of required interactions before finding the optimal behaviour, the second contribution of the thesis was to define a theoretically sound criterion making the trade-off between the sometimes contradicting desires of complying with the user's preferences and demonstrating sufficiently different behaviours. The last contribution was to ensure the robustness of the algorithm w.r.t. the feedback errors that the user might make. Which happens more often than not in practice, especially at the initial phase of the interaction, when all the behaviours are far from the expected solution.Les contributions de la thèse sont centrées sur la prise de décisions séquentielles et plus spécialement sur l'Apprentissage par Renforcement (AR). Prenant sa source de l'apprentissage statistique au même titre que l'apprentissage supervisé et non-supervisé, l'AR a gagné en popularité ces deux dernières décennies en raisons de percées aussi bien applicatives que théoriques. L'AR suppose que l'agent (apprenant) ainsi que son environnement suivent un processus de décision stochastique Markovien sur un espace d'états et d'actions. Le processus est dit de décision parce que l'agent est appelé à choisir à chaque pas de temps du processus l'action à prendre. Il est dit stochastique parce que le choix d'une action donnée en un état donné n'implique pas le passage systématique à un état particulier mais définit plutôt une distribution sur l'espace d'états. Il est dit Markovien parce que cette distribution ne dépend que de l'état et de l'action courante. En conséquence d'un choix d'action, l'agent reçoit une récompense. Le but de l'AR est alors de résoudre le problème d'optimisation retournant le comportement qui assure à l'agent une récompense maximale tout au long de son interaction avec l'environnement. D'un point de vue pratique, un large éventail de problèmes peuvent être transformés en un problème d'AR, du Backgammon (cf. TD-Gammon, l'une des premières grandes réussites de l'AR et de l'apprentissage statistique en général, donnant lieu à un joueur expert de classe internationale) à des problèmes de décision dans le monde industriel ou médical. Seulement, le problème d'optimisation résolu par l'AR dépend de la définition préalable d'une fonction de récompense adéquate nécessitant une expertise certaine du domaine d'intérêt mais aussi du fonctionnement interne des algorithmes d'AR. En ce sens, la première contribution de la thèse a été de proposer un nouveau cadre d'apprentissage, allégeant les prérequis exigés à l'utilisateur. Ainsi, ce dernier n'a plus besoin de connaître la solution exacte du problème mais seulement de pouvoir désigner entre deux comportements, celui qui s'approche le plus de la solution. L'apprentissage se déroule en interaction entre l'utilisateur et l'agent. Cette interaction s'articule autour des trois points suivants : i) L'agent exhibe un nouveau comportement ii) l'expert le compare au meilleur comportement jusqu'à présent iii) l'agent utilise ce retour pour mettre à jour son modèle des préférences puis choisit le prochain comportement à démontrer. Afin de réduire le nombre d'interactions nécessaires entre l'utilisateur et l'agent pour que ce dernier trouve le comportement optimal, la seconde contribution de la thèse a été de définir un critère théoriquement justifié faisant le compromis entre les désirs parfois contradictoires de prendre en compte les préférences de l'utilisateur tout en exhibant des comportements suffisamment différents de ceux déjà proposés. La dernière contribution de la thèse est d'assurer la robustesse de l'algorithme face aux éventuelles erreurs d'appréciation de l'utilisateur. Ce qui arrive souvent en pratique, spécialement au début de l'interaction, quand tous les comportements proposés par l'agent sont loin de la solution attendue
    corecore